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Don't They All Measure the Same Thing? 
Consequences of Select; ^ Standardized Tests 



Robert E. Floden, 
Andrew C. Porter, William H. Schmidt, 
and Donald J. Freeman^ 

Recently, several authors have called attention to the importance 
of consistency between the conten'. of instruction and the content of 
tests used to assess instruction (e.g., VJalker & Schaf farz5ck, 1974; 
Wiley, Note 1; Porter, Schmidt, Floden, a Freeman, Note 2). If test 
content does not match instructional content., test results might 
reflect a distorted picture of the instruction effects. Many tests, it 
has been found, cover only a fraction of the content presented in many 
elementary classrooms. The major standardized achievement tests 
for elementary schools, for example, focus on basic skills, and, as a 
result, the test scores reflect only achievement in basic skills, 
rather than achievement in the total instructional content covered. 

Many people acknowledge that standardized test scores do not 
reflect instruction in content outside the basic skills, but it has 
seldom been recognized that even the definition of basic skills may 
vary from test to test. In fact, rhe tests do not all measure the same 
content knowledge, des;)ite the prevailing opinion to the contrary (e.g., 
Cooley & Lohnes, 1976). This sugge^its that particular attention should 

^This paper was presented at the UCLA Center for the Study of 
Education Winter Conference on Measurement and Methodology, 1978. 
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The authors are senior researchers in the Institute for Research 
.on Teaching and members of IRT's External Factors Research Group. 
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be paid to the content of the tests chosen to ensure that test results, 
reflect achievement on the content considered important in a local 
school district. 

While the effect of test content on test scores has been recognized, 
the effect of test content on instruction has received little attention. 
Althoagh some people believe that the initiation of a testing program may 
have global effects on instruction O-eading perhaps to greater emphasis on 
the basic skills), more subtle shifts in instructional content brought on 
by use of particular tests are only beginning to be investigated. These 
and other influences on a teacher's selection of instructional content 
deserve careful investigation. 

District Test Use 
Many school district administrators are concerned writh raising 
student achievement in the basic skills. This concern is reflected by 
such administrative policies as constructing lists of objectives, meeting 
with teachers in workshops to discuss goals and methods, rewarding school 
building administrators for improved performance in their schools, and 
testing student progress on a regular basis. Although these actions are 
frequently coupled with the development of criterion-referenced tests, 
nearly all districts continue to administer norm-referenced standardized 
achievement tests to assess improvement within ,the district and to com- 
pare their district performance with national norms. 

Little is known about the criteria administrators use to select a 
test series. Factors considered probably include cost, ease of \^ 

The terms or%ter^on~referenoed and norm-referenced tests have taken 
ca a variety of meanings. In this context, we define criterion-referenced 
tests as those in which an Individual is assessed relative to a certain 
standard, whereas notm-ref erenced tests assess the individual's perforiuance 
relative to other individuals or to u group average. 



administration, ease of scoring, and reporting format. One factor which 
does not receive enough attention is the specific content covered by each 
subtest. School personnel have been led to be.lieve that the major test 
series differ li'.ttle in terms of the topics they test (at least for 
subtests witn similar titles). High correlations among the tests, 
together with apparent unidimensionality of each subtest, have suggested 
that the tests all measure the same things. If that were the case, the 
content covered by a given test could not be a test-selection criterion, 
since that content would be almost identical to the content covered by 
other tests. In this paper, we argue that there are differences in 
the content covered by the major tests and that those differences have 
consequences, not c y for assessing district progress, but also for the 
content of instruction in the district's classrooms. 

Test: Analyses 

Four test series dominate the market in i&lementary reading and 
mathematics. These series are: the Stanford Tests of Basic Skills (SAT), 
the Iowa Tests of Basic Skills (Iowa), the Metropolitan Achievement Tests 
(MAT), and the CTB/McGraw-Hill Comprehensive Tests of Basic Skills (CTBS) . 
Each test series is composed of tests designed for specific levels ol 
achievement, roughly corresponding to gr£ de levels. Tests at each level 
consist of several subtests, d oadly grouped into areas covering different 
reading and mathematics skills. In mathematics, for example, the sub- 
tests typically include "Mathematics Applications," "Computation," and 
"Concepts. " 

We at the Institute for Research on Teaching (IRT) recently examined 
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the fourth-grade mathematics content covered by the four test series. 
We used an iterative process of analysis and classification of items on 
the SAT to develop a taxonomy for describing the tests. A complete 
presentation of the taxonomy is gix^en in Figure 1; each cell in^he 
classification matrix corresponds to a topic that a teacher might elect 
to cover. (The process of test description has been described elsewhere 
in greater detail /Porter et al . , Note 2; Schmidt, Porter, Floden, & 
Freeman, Note 57 \ 

Comparisons of the four tests, detailed in Table 1, indicate that 
although they are quite similar in some respects, they also have 
striking differences. On the "operations" dimensjion, the tests corres- 
ponded quite closely in the peiceucages of items involving "subtract 
without borrowing " (6-8%), "add or subtract fractions without a 
common denominator" (0-2%), and "divide with remainder" (1%). For the 
other levels, however, differences were apparent. TVenty-one percent of 
the it^ms on the MAT, for example, involved addition, while the 
corresponding figures for the Iowa, SAT, and CTBS were only 12, 13, and 
14% respectively. The Iowa had at least 5 percentage points fewer 
multiplication items than the other tests. Grouping was tested by the 
SAT but not by the MAT or the CTBS. 

With respect to the nature of the material, more similarities than 
differences v re found, but the differences may be quite significant. 
Six percent of the items on the CTBS, for ex-^mple, involved percentage, 
while the Iowa and MAT had no such items. The MAT and SAT both included 
items on alternative number systems, while the Iowa and CTBS did not. 



Copyright dates of the tests which were analyzed are as follows: 
SAT - 1973; Iowa - 1971; MAT - 1970; CTBS - 1968. 
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Table 1 



ITEM DISTRIBUTIONS FOR EACH FACTOR ACROSS TESTS 
FOURTH GRADE LEVEL 



IOWA 



MAT 



SAT 



CiTBS 



I. Mode of Presentation 



graphs, figures, tables, etc. 
operation (s) specified 
operatioii(s) not specified 



II. N ature of Material , 

- single digits 

- single and multiple digits 

- multiple digits 

- total — whole nui ers 

- Jingle fraction 

- multiple fractions 

- decimals 

- percents 

- alter^ number systems 
. - place value 

- number sentences 

- algebraic sentences 

- essen. units Tieas. 

- geonetric figures 

- oth€.r 



III. Operations 

- add 

~ subtract w/o borrowing 
~ subtract with borrowing 

- add or subtract fractions 

w/o common denominator 

- multiply 

~ divide w/o remainder 

- divide with remainder 

- combination 

- grouping 

- identify equivalents 

- identify rule (order) 

- Identify terriis 
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The significance of these differences becomes more apparent when 
one considers that a single correct answer on the SAT math subtest can 
add approximately .2 to a student's grade equivalent score, if it is 
near the middle of the norm aistribution. Since improvernents of a 
fraction of a grade level are generally considered important, it seems 
likely that the differences among tests could result in score differ- 
ences that would also be considered importar^t. 

Although our analysis of the mathematics tests may be open to 
criticism, it does illustrate that differences can and sometimes do 
exist among the standardized tests, tests oftei thought to be virtually 
interchangeable, ■ Hence, as we have previously suggested, content covered 
is a factor which administrators must cgnsider when selecting standard- 
ized tests, A school district that emphasizes work with percentages In 
fourth grade NTOuld get a distorted picture of progress from the Iowa, 
which contains no percentage problems. On the other hand, a" district 
which does not introduce percents until the sixth grade would be unneces- 
sarily discouraged by the results of the CTBS, which contains 6% problems 
involving percentage. 

Differences in supposedly equivalent \ests have been found by 
other authors as well. Linn and Slinde (1977), for example, founa that 
a change in test forms can lead to substantially different .grade 
equivalent rocres for low achieving children. Linn (Note 4) also 
discovered that even when tests have beet) empirically equated, the choice 
of test can greatly influence the estimate of program effects. Neither 
of these studies, however, attempted to explain the cause of discrepancy 
in results'. 




Differences in test content nuay partially explain the dlscrep .ncy. 

tl^e content discrepancy may result, not from the test publishers' - - 

failu.-e to include the appropriate content, biit from a general lack of 

agreement on the definition of basic -skills. In mathematics, for example, 

cons-^derable attention .was given during the 1960s to the question of 

whether or not material such as ''elementary set theory" was part of the 

basic skills. That question --and others like it.-was never answered, 

as evidenced by proceedings of a National Institvite of Ediication . (NIE> 

Conference in October 1975, directed at the question, "What are basic 

mathematical skills and learning?" (NIE, 1975). 

It is proposed here that the determination of what mathematics 
is most worth learning is a task that will require careful and 
systematic study from the perspectives of several interest 
groups. (Helms & Grieber, 1975, p^ 70) 

The challenge to describe basic skills and learning in school 
mathematics is an assignment full of piLfalls. In the past 
five years, hundred of mathematics educators, school systems, 
professional groups and the National" Assessment have been 
busily composing taxonomies of fundamental objectives for 
mathematics instruction at various grade levels. With few 
exceptions, these efforts to establish a reasonable list of 
basic skills have been failures. There has been no general 
agreement among 'the competinjc; groups . Moreover, the implemerta- 
tion of the various lists of curriculum guidelines threatens to 
produce fragmented mathematics programs that resemble occupational 
training more than they resemble education in mathematical methods 
and understandings likely tD be of long range value. (Fey, 1975, 
p. 51, emphasis added) ^ 

These educators may have exaggerated the differences of opinion concerning 
composition of the basic skills, but it cannot be assumed without 
argument that one set of skills has general sanction.' 

Empirical Undim u-n ^ i pnality 

Empirical studies of the star^^^ird zed tests seeir. to contradict the 

findings of our con.^nt analysis. - internal consistency has been 

reported for virtually all subtests, which would seem to indicate that the 
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test developers have been successful in constructing unidimensional tests. 
Developers of the SAT tests report, for example, that the mathematics sub- 
tests of concepts, computation, »^and applications given to beginning fifth 
graders (Intermediate Level 1, Form A, 1973) have internal consistency 
reliabilities of .87, .91, and .93, respectively.^ Evidence of internal 
consistency has been taken as evidence that all items measure a single 
trait, and, as such, brings into question the utility of identifying 
subsets of items (see, for example, Goolsby, 1966). 

There are at least two reasons, however, why conclusions based upon 
evidence of empirical unidimensionality may be misleading-. The first 
stems from the definition of empirical unidimensionality; the second is 
a function of the ways in which unidimensionality is estimated. 

The empirical definition of unidimensionality calls for a large 
first factor on the item^ltrfeercorrelation matrix. Thus, empirical 
unidimensionality is a static concept specific to the time of test 
administration and to the population of respondents. For purposes of , 
illustration, consider a population of respondents and a set of items 
that yield an item intercorrelation matrix with equal off-diagonal 
elements. Let the respondents be beginning fourth-grade student:s, and 
let half the items require division with remainder and half the 
items require multiplication of three-digit numbers. An intervention 
focused exclusively on multiplication of three-digit numbers might uni- 
formly reduce the difficulty of half the items. In this case, the only 
effect the intervention would have on the item intercorrelation matrix 
would be to create a difficulty factor. Yet, despite empirical uni- 

^These figures .are reported in. the SAT Manual, Norms Booklet, Form A, 
p. 15 (copyright 1973 by Harcourt Brace Jovanovich, Inc.). 

Er|c 13 ' 
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dimensionality — both prior to and after the Intervention 



i.L. clear 



that there Is a useful distinction between the two subsets of Items. 
It Is necessary, therefore, to ask whether or not a test Is 
unldlmenslonal relative to differences In Instruction (I.e., Does all 
fourth-grade mathematics Instruction affect all Item difficulties 
equally?). Searching for differential effects across Items Is analogous 
to searching for aptltude-by-treatment Interactions (ATIs) and might be 
called the "search for Item-by-treatment Interactions" (ITIs) . 

Most test data, however, are not obtained from {People receiving 
uniform instruction. Different students have different educational 
experiences, and these experiences may have different effects across 
Items. If a test consists of sets of Items defined by concepts such 
that within each set the effect of an Intervention Is constant, and If 
the effects of Interventions vary with less than perfect correlation 
across sets of Items, the sets of Items should be reflected In the 
pattern of Item Intercorrelatlons; the Intervention effects contribute 
to both the covarlance and variance of Items within a set but not to 
the covarlance of Items In different sets. 

Estimates of Internal consistency based on data from norm groups of 
standardized tests seem to challenge the Importance of Ills. The apparent 
unldlmenslonallty of standardized tests, however, might only be evidence 
of the existence of a strong single dimension, not of the absence of 
content factors. The Spearman**Brown prophecy formula implies that the 
more concepts Included^ the stronger the general factoi:, and that the 
fewer Items per concept, the less clearly defined the second-border con- 



cept factors. Evidence of an Internally consistent test should not be 
taken as an Indication that searching for ITIs In evaluations using 
that test Is useless. 
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Effects on Classroo m 
Critics of testing are concern^. ne global effects of testing 

programs, and the analyses above suggest that some effect specific to 
the test selected may also exist. But effects of testing must also 
be considered in the broader investigation of the way instructional 
content is affected by factors outside the classroom. 

Many believe that testing programs have some influence on the 
content teachers present. The prevailing opinion is that teachers 
are apt to "teach to the test," that is, to present content that 
closely follows the content of the test (see, for example, Cooley & 
Lohnes, 1976). Two groups consider this phenomenon undesirable: thoi^e who 
believe that the tests represent only a fraction of the content that is Jon- 
portanty and those who believe that teachers should exercise their own 
Judgment in determining instructional content. On the other hand, those 
who believe that the content of instruction should be uniform across 
classrooms (perhaps focusing on the basic skilla) may see testing programs 
as a valuable tool for determining classroom Instructional content. 
Each group bases its assessment of the tests, in part, on the assuisption 
that teachers teach to the test, but neither can provide much empirical 
evidence to support that assumption. 

It may be that the institution of a testing progreun leads teachers 
to spend more time on the material In the general area. of test coverage, 
but not on the specific items covered by the test. Teachers, for 
example, might pay attention to the titles of the subtests but rely on 
their own preconceived ideas about the content in those areas to 
determine what content to cover. On the other hand, they might consider 
the testing program an unwarranted Imposition and pay no attention to it 

Er|c , 15 
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at all. In short, while the • nlauslble. It Is far ^^nm certain 

tt.. eachers teach t. 4 * covered h 

their districts. 

The likelihood that j teachers teach to the test seems even 
smaller when one considers the other factors that might influence content 
choice. An alternative influence frequently cited in the literature 
is the textbook supplied to the teacher. Schutz (Note 5) has indicated 
that while teachers may initially claim that they select content by consid- 
ering abstract goals of instruction, when pressed for detail, they often 
admit that they teach whatever material is covered in the textbook 
used by their students. Lists of objectives issued by the school 
district are another likely source of influence. The objectives are 
generally intended to influence, if not determine, content choice; hence, 
they act as a strong competitor to testing programs. It is not 
difficult to add to the list of possible influences. Teacher conceptions 
of subject matter, teacher assessment of student achievement, and 
student interest in subject matter all might affect a teacher's choice 
of content. When these alternative, partial determinants of content are 
considered, the assumption that teachers teach to the test seems less 
reasonaole. 

\ Advice to School Districts 

Iiiq>lic'^t in this discussion are two suggestions which might help 

6 

to guide schoo^l districts in their selection of standardized r.ests. 
(These suggestions apply to the question of whether any testing should 
be conducted only insofar as they may help administrators dete^ine the 

^The recommendations offered here are general; more specific recom* 
owmdations must await the results of research currently underway at IRT. 
^ (Porter, Note 6; Floden, Note 7; Schmidt, Note 8, Freeman, Note 9.) 

ERIC ' ^ 16 . 
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probable Impacts of existing tests.) 

First, the multitude of possible '-^^^ ices on content choice 
suggests that consistency is a minima. r-quirement if a disttx^c xc* Lv 
exercise control over instructional content in the classroom. If 
district administrators want specific content presented throughout the 
district, they would be well advised to ensure that the tools at their 
disposal are used toward that end. If textbook selection, test selection, 
and lists of objectives are all determined at the district level, then 
the administrator should make chbices which will bring about consistent 
pressure on teachers to teach the desired content. To avoid losing 
ground by^ say, choosing a textbook that covers different material from 
a carefully chosen test, administrators must make a thorough examination 
of tests, texts, and obje:.tives. Furthermore, they should Use whatever 
control they have over possible influences (such as parent pressure and 
principal pressure) to make sure that each emphasizes the same desired 
content. 

Second, even if tests are ultimately found to have little specific 
influence on instructional content, it Is still important that the content 
covered by the test match the content of most concern to those using the 
test results. The standardized tests do not all measure achievement In 
the same content areas, and the use of a test that assesses progress In an 
area of little concernr to the district may be misleading. 

Conclusion «i 
The effects of test selection, both on the interpretation of 
results and on the content of instruction, have been over-simplified, denied, 
or Ignored. Such treatment seems inappropriate when one considers that 

17 
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an examinat^.on of four major tests has raised serious doubts about the 
assumption that the tests all measure the same achievement. If this 
widely-held assumption does not stand up to scrutiny, there Is ample 
reason to question other assumptions about the effects of test selection. 

Consideration of the effects of tests on content Immediately calls to 
mind a broader area much In need of study — the manner In which 
teachers choose Instructional content. Research Is needed to Identify 
the mechanisms by which teachers respond to the. multitude of pressures 
to choose Instructional content. A first attempt at Investigating this 
area Is now underway at the Institute for Research on Teaching. In two 
parallel studies, the relative Influences of six excemal factors are' 
being examined. Th<2 factors are: testing program, set of objectives, 
textbooks, pressure from parents, pres^.ure from teachers of higher grades, 
and pressure from the principal. In one study, teachers are asked to Indi- 
cate how they think they would react to these pressures In a hypothetical 
situation. In the second study, assessments are being mad€^ In a number 
of school districts, of the relationship between content covered and the 
external pressures at work (Porter, Note 6; Floden, Note 7; Schmidt, Note 
8; Freeman, Note 9). Both studies deal only with fourth-grade mathemaflcs. 
While these studies should provide some clues about the ways In which 
teachurs detern^lne content choices, the process Is surely complex and will 
require prolonged study before It Is well understood. 

If content covered Is Important In termfi of student achievement (and 
this Is generally acknowledged) > then the means by which teachers choose 
content, and the means by which administrators can Influence tnat choice 
muat be better understood. Indeed, Increased understanding In these areas 
could be the key to the much sought improvement In student performance. 




Selecting tests on the basis of the content they cover may provide 
Immediate benefits and i»i ■ i stntt point for the investiga 

tior. of the relationships among testing, achievement, and content 
coverage. 
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